Sources¶
- https://github.com/eclarson/MachineLearningNotebooks/blob/master/04.%20Dimension%20Reduction%20and%20Images.ipynb
- ChatGPT (For formatting text and plots)
Business Understanding (2 pts)¶
Overview of the Dataset and Its Purpose
The Pistachio Image Dataset (Kaggle link) is a carefully curated collection of images featuring two distinct pistachio varieties: Kirmizi and Siirt. These varieties differ in both appearance and nutritional profile. Kirmizi pistachios are known for their vibrant color and bold flavor, while Siirt pistachios are valued for their smoother taste and slightly different protein content. For example, Acar and Eti (2011) report protein ranges of 18.23–19.09% for Kirmizi and 17.51–18.08% for Siirt, highlighting subtle but meaningful nutritional differences.
The purpose of the dataset is twofold: (1) to enable the development of an advanced image classification model that can automatically distinguish between these varieties, and (2) to support both processors and consumers in the pistachio industry. Processors benefit from automated sorting and precise labeling, which enhances efficiency and product credibility. Consumers, on the other hand, gain confidence in identifying the pistachio variety that best suits their dietary needs or flavor preferences.Prediction Task
For our project, the prediction task is to build an image classification model that can accurately differentiate between Kirmizi and Siirt pistachios. Unlike regression-based tasks, this is a binary classification problem, where each pistachio image is assigned to one of the two categories. The state-of-the-art accuracy for this dataset currently stands at 99.89%, and our objective is to meet or exceed this benchmark. Achieving such high precision is critical because even small misclassifications could reduce trust in the model for commercial use, especially in automated processing pipelines.Why This Matters and Performance Expectations
The results of this classification task are meaningful for multiple stakeholders. For processors, high accuracy ensures consistent quality control and reduces human error in sorting. For consumers, particularly those with dietary goals (e.g., athletes monitoring protein intake), the ability to reliably identify pistachio varieties supports informed food choices. Beyond these practical applications, this project also showcases the potential of machine learning in agriculture, where advanced models can streamline product handling and improve consumer knowledge.
By aiming for above 99.89% accuracy, we push the boundaries of agricultural image classification, striving to deliver a model that is both reliable and industry-competitive. In doing so, the Pistachio Image Dataset becomes more than just a collection of images—it becomes a gateway to smarter agricultural practices and better consumer transparency.
Data Preparation (1 pts)¶
Part One (0.5 pts)¶
Read in your images as numpy arrays. Resize and recolor images as necessary.
import os
import cv2
import numpy as np
dataset_path = "../dataset/Pistachio_Image_Dataset"
img_size = (128, 128)
images = []
labels = []
for variety in os.listdir(dataset_path):
variety_path = os.path.join(dataset_path, variety)
if not os.path.isdir(variety_path):
continue
for filename in os.listdir(variety_path):
file_path = os.path.join(variety_path, filename)
img = cv2.imread(file_path, cv2.IMREAD_GRAYSCALE)
if img is None:
continue
img_resized = cv2.resize(img, img_size)
img_array = img_resized.astype("float32") / 255.0
images.append(img_array)
labels.append(variety)
images = np.array(images)
labels = np.array(labels)
print(images.shape)
print(labels.shape)
(2148, 128, 128) (2148,)
Part Two (0.4 pts)¶
Linearize the images to create a table of 1-D image features (each row should be one image).
import pandas as pd
# Create table consists of 2148 rows, where each row contains 16385 features (16384 pixels and 1 label)
X = images.reshape(images.shape[0], -1)
y = labels
pixel_cols = [f"pixel_{i}" for i in range(X.shape[1])]
df = pd.DataFrame(X, columns=pixel_cols)
df["label"] = y
print(df.shape)
print(df.head())
(2148, 16385)
pixel_0 pixel_1 pixel_2 pixel_3 pixel_4 pixel_5 pixel_6 pixel_7 \
0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0
pixel_8 pixel_9 ... pixel_16375 pixel_16376 pixel_16377 pixel_16378 \
0 0.0 0.0 ... 0.0 0.0 0.0 0.0
1 0.0 0.0 ... 0.0 0.0 0.0 0.0
2 0.0 0.0 ... 0.0 0.0 0.0 0.0
3 0.0 0.0 ... 0.0 0.0 0.0 0.0
4 0.0 0.0 ... 0.0 0.0 0.0 0.0
pixel_16379 pixel_16380 pixel_16381 pixel_16382 pixel_16383 \
0 0.0 0.0 0.0 0.0 0.0
1 0.0 0.0 0.0 0.0 0.0
2 0.0 0.0 0.0 0.0 0.0
3 0.0 0.0 0.0 0.0 0.0
4 0.0 0.0 0.0 0.0 0.0
label
0 Siirt_Pistachio
1 Siirt_Pistachio
2 Siirt_Pistachio
3 Siirt_Pistachio
4 Siirt_Pistachio
[5 rows x 16385 columns]
Part Three (0.1 pts)¶
Visualize several images.
import matplotlib.pyplot as plt
# plot function from Dr.Larson
def plot_gallery(images, titles, h, w, n_row=3, n_col=6):
plt.figure(figsize=(1.7 * n_col, 2.3 * n_row))
plt.subplots_adjust(bottom=0, left=.01, right=.99, top=.90, hspace=.35)
for i in range(n_row * n_col):
plt.subplot(n_row, n_col, i + 1)
plt.imshow(images[i].reshape((h, w)), cmap=plt.cm.gray)
plt.title(titles[i], size=12)
plt.xticks(())
plt.yticks(())
plot_gallery(images, labels, 128, 128,1,5)
Data Reduction (6 pts)¶
Part One (0.5 pts)¶
Perform linear dimensionality reduction of the images using principal components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.
from sklearn.decomposition import PCA
# plot function from Dr.Larson
def plot_explained_variance(pca):
import plotly
from plotly.graph_objs import Bar, Line
from plotly.graph_objs import Scatter, Layout
from plotly.graph_objs.scatter import Marker
from plotly.graph_objs.layout import XAxis, YAxis
plotly.offline.init_notebook_mode() # run at the start of every notebook
explained_var = pca.explained_variance_ratio_
cum_var_exp = np.cumsum(explained_var)
plotly.offline.iplot({
"data": [Bar(y=explained_var, name='individual explained variance'),
Scatter(y=cum_var_exp, name='cumulative explained variance')
],
"layout": Layout(xaxis=XAxis(title='Principal components'), yaxis=YAxis(title='Explained variance ratio'))
})
pca = PCA(n_components=325).fit(X)
plot_explained_variance(pca)
plt.show()
print("Figure 1: Explained variance of principal components using standard PCA\n")
print("Number of components chosen:", pca.n_components_)
print("Cumulative explained variance:", sum(pca.explained_variance_ratio_))
Figure 1: Explained variance of principal components using standard PCA Number of components chosen: 325 Cumulative explained variance: 0.9506371
According to the universal standard, retaining around 95% of the variance is generally considered sufficient to adequately represent image data while filtering out minor noise. Based on Dr. Larson’s explained variance plot, we observed that the cumulative curve reaches this threshold at about 325 components (dimensions). This means that instead of relying on all 16,384 original pixel features, the pistachio images can be effectively represented in a much smaller feature space of 325 dimensions, while still preserving nearly all of the important structural information.
Part Two (0.5 pts)¶
Perform linear dimensionality reduction of your image data using randomized principle components analysis. Visualize the explained variance of each component. Analyze how many dimensions are required to adequately represent your image data. Explain your analysis and conclusion.
rpca = PCA(n_components=325, svd_solver='randomized').fit(X)
plot_explained_variance(rpca)
plt.show()
print("Figure 2: Explained variance of principal components using Randomized PCA\n")
print("Number of components chosen:", rpca.n_components_)
print("Cumulative explained variance:", sum(rpca.explained_variance_ratio_))
Figure 2: Explained variance of principal components using Randomized PCA Number of components chosen: 325 Cumulative explained variance: 0.9506435
The results from Randomized PCA closely mirror those of standard PCA. Similar to the conventional approach, Randomized PCA also needs around 325 components (dimensions) to capture about 95% of the variance—a commonly accepted benchmark for adequately representing image data. This shows that both methods provide nearly the same representational quality for the dataset.
Part Three (2 pts)¶
Compare the representation using PCA and Randomized PCA. The method you choose to compare dimensionality methods should quantitatively explain which method is better at representing the images with fewer components. Do you prefer one method over another? Why?
# compare running time and eigenimage representations of PCA and RPCA
print("PCA time:")
%time pca = PCA(n_components=325).fit(X)
print("Randomized PCA time:")
%time rpca = PCA(n_components=325, svd_solver='randomized').fit(X)
pca_eigen = pca.components_.reshape(325, 128, 128)
eigenpca_titles = ["PCA eigenimage %d" % i for i in range(pca_eigen.shape[0])]
print("PCA eigenimages:")
plot_gallery(pca_eigen, eigenpca_titles, 128, 128, 1, 5)
rpca_eigen = rpca.components_.reshape(325, 128, 128)
eigenrpca_titles = ["RPCA eigenimage %d" % i for i in range(rpca_eigen.shape[0])]
print("Randomized PCA eigenimages:")
plot_gallery(rpca_eigen, eigenrpca_titles, 128, 128, 1, 5)
PCA time: CPU times: user 9.93 s, sys: 888 ms, total: 10.8 s Wall time: 1.45 s Randomized PCA time: CPU times: user 9.95 s, sys: 669 ms, total: 10.6 s Wall time: 1.35 s PCA eigenimages: Randomized PCA eigenimages:
Discussion
According to Figure 1 and Figure 2, both PCA and Randomized PCA required about 325 components to capture approximately 95% of the variance in the pistachio dataset. This dimensionality reduction greatly reduces the original 16,384 pixel features while still preserving the essential structure and variability of the images. By compressing the data into this smaller set of components, the representation becomes far more efficient for both storage and computation without significant information loss.
The results also show that both methods achieved nearly identical explained variance and produced very similar eigenimage representations. These eigenimages illustrate the basis patterns extracted from the dataset, and the visual similarity between PCA and Randomized PCA confirms that the randomized approximation preserves the same important structures as standard PCA.
Although runtime measurements can vary with each execution, Randomized PCA generally tends to be more efficient and scalable to larger datasets. Its advantage lies in achieving comparable accuracy to standard PCA while reducing computational overhead.
In summary, both PCA and Randomized PCA are strong methods for dimensionality reduction, but the combination of efficiency, scalability, and consistent eigenimage quality makes Randomized PCA the more practical choice for large-scale applications.
Part Four (1 pts)¶
Perform feature extraction upon the images using any feature extraction technique (e.g., gabor filters, ordered gradients, DAISY, etc.).
import numpy as np
import matplotlib.pyplot as plt
from skimage.filters import sobel_h, sobel_v
from skimage.feature import daisy
# assume 'images' is a numpy array of shape (N, H, W)
_, h, w = images.shape
# pick a random image to reconstruct
idx_to_reconstruct = int(np.random.rand() * len(X))
img = X[idx_to_reconstruct].reshape((h, w))
# visualize original image
plt.subplot(1, 2, 1)
plt.imshow(img, cmap='gray')
plt.title("Original Image")
plt.axis('off')
# visualize gradient magnitude
gradient_mag = np.sqrt(sobel_v(img)**2 + sobel_h(img)**2)
plt.subplot(1, 2, 2)
plt.imshow(gradient_mag, cmap='gray')
plt.title("Gradient Magnitude")
plt.axis('off')
plt.show()
# DAISY feature descriptor visualization
params = {
"step": 15,
"radius": 10,
"rings": 2,
"histograms": 6,
"orientations": 8
}
features, img_desc = daisy(img, visualize=True, **params)
plt.imshow(img_desc, cmap='gray')
plt.title("DAISY Descriptor Visualization")
plt.axis('off')
plt.show()
features = daisy(img, visualize=False, **params)
print(features.shape)
print(features.shape[0]*features.shape[1]*features.shape[2])
(8, 8, 104) 6656
def apply_daisy(row,shape):
feat = daisy(row.reshape(shape), visualize=False, **params)
return feat.reshape((-1))
%time test_feature = apply_daisy(X[3],(h,w))
test_feature.shape
CPU times: user 9.25 ms, sys: 2.45 ms, total: 11.7 ms Wall time: 11.5 ms
(6656,)
# apply to entire data, row by row,
# takes about a minute to run
%time daisy_features = np.apply_along_axis(apply_daisy, 1, X, (h,w))
print(daisy_features.shape)
CPU times: user 15 s, sys: 1.16 s, total: 16.2 s Wall time: 16.2 s (2148, 6656)
from sklearn.metrics.pairwise import pairwise_distances
# find the pairwise distance between all the different image features
%time dist_matrix = pairwise_distances(daisy_features)
CPU times: user 4.23 s, sys: 128 ms, total: 4.36 s Wall time: 571 ms
import numpy as np
import matplotlib.pyplot as plt
# find closest image to current image
for idx1 in range(1, 20):
# make a copy of the distances for this row
distances = dist_matrix[idx1, :].copy()
distances[idx1] = np.inf # don't pick the same image!
# index of the closest image
idx2 = np.argmin(distances)
plt.figure(figsize=(7, 10))
# original image
plt.subplot(1, 2, 1)
plt.imshow(X[idx1].reshape((h, w)), cmap='gray')
plt.title("Original Image")
plt.axis('off') # removes gridlines for images
# closest image
plt.subplot(1, 2, 2)
plt.imshow(X[idx2].reshape((h, w)), cmap='gray')
plt.title("Closest Image")
plt.axis('off')
plt.show()
Part Five (2 pts)¶
Does this feature extraction method show promise for your prediction task? Why? Use visualizations to analyze this questions. For example, visualize the differences between statistics of extracted features in each target class. Another option, use a heat map of the pairwise differences (ordered by class) among all extracted features. Another option, build a nearest neighbor classifier to see actual classification performance.
from sklearn.metrics import pairwise_distances
import seaborn as sns
# compute average daisy feature vector per class
classes = np.unique(y)
class_means = np.array([daisy_features[y==c].mean(axis=0) for c in classes])
# compute pairwise distances between class means
dist_matrix = pairwise_distances(class_means)
# plot heatmap
plt.figure(figsize=(8,6))
sns.heatmap(dist_matrix, xticklabels=classes, yticklabels=classes, cmap="viridis", annot=True, fmt=".2f")
plt.title("Pairwise Distances Between Class-Averaged DAISY Features")
plt.show()
The heatmap shows the pairwise distances between the average DAISY feature vectors for each class. Larger values mean the classes are more distinct in feature space, while smaller values indicate potential overlap or confusion. This visualization highlights which classes DAISY features separate most effectively and which ones may be more difficult to distinguish. In the heatmap, diagonal values are zero as expected (a class compared to itself). The off-diagonal distances are around 0.5, which indicates that DAISY features separate classes to some extent. While these distances are not extremely large, they are greater than zero, showing that the classes have measurable separation in feature space. This supports the idea that DAISY features capture useful discriminative information.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
daisy_features, y, test_size=0.2, random_state=42
)
# Initialize and train KNN classifier
knn_dsy = KNeighborsClassifier(n_neighbors=3)
knn_dsy.fit(X_train, y_train)
# Evaluate accuracy
acc_dsy = accuracy_score(knn_dsy.predict(X_test), y_test)
print(f"Daisy Accuracy: {100*acc_dsy:.2f}%")
Daisy Accuracy: 83.02%
Using the k-nearest-neighbor classifier to see the classification performance, Daisy gave an accuracy of 83.02%, indicating that the extracted DAISY features are promising for classification.
The heatmap of class-averaged DAISY features showed measurable differences between the two classes, indicating that the descriptors capture useful patterns that distinguish between these two categories. This conclusion is supported by the nearest neighbor classifier, which achieved an accuracy of 83.02%. Together, the heatmap analysis and classification result demonstrate that DAISY feature extraction is a promising approach for this prediction task.
Exceptional Work (1 pts)¶
Perform feature extraction upon the images using DAISY. Rather than using matching on the images with the total DAISY vector, you will instead use key point matching. You will need to investigate appropriate methods for key point matching using DAISY. NOTE: this often requires some type of brute force matching per pair of images, which can be computationally expensive. Does it perform better than not using key point matching?
from skimage.feature import daisy
import numpy as np
from sklearn.neighbors import NearestNeighbors
from tqdm import tqdm
def apply_daisy(img_row, shape):
"""
Compute DAISY descriptors for an image and return as flat descriptors.
"""
feat = daisy(img_row.reshape(shape), step=5, radius=5,
rings=2, histograms=8, orientations=4,
visualize=False)
s = feat.shape
return feat.reshape((s[0]*s[1], s[2]))
# Step 1: Compute DAISY descriptors for all images
daisy_features_list = [apply_daisy(img, (h, w)) for img in X]
# Step 2: Flatten all descriptors and keep track of image indices
all_descriptors = np.vstack(daisy_features_list)
image_indices = np.hstack([[i]*len(d) for i, d in enumerate(daisy_features_list)])
# Step 3: Build NearestNeighbors model
nbrs = NearestNeighbors(n_neighbors=1, algorithm='auto', metric='euclidean').fit(all_descriptors)
# Step 4: Count matches between images
n_images = len(daisy_features_list)
daisy_featured_matching = np.zeros((n_images, n_images-1), dtype=int)
for i in tqdm(range(n_images)):
descriptors_i = daisy_features_list[i]
distances, indices = nbrs.kneighbors(descriptors_i, n_neighbors=1)
# Threshold distance for "match" (optional, tweak if needed)
threshold = 0.1 # adjust depending on your data
matched_indices = indices[distances[:, 0] < threshold]
# Count matches per other image
counts = np.zeros(n_images, dtype=int)
for idx in matched_indices:
img_j = image_indices[idx]
if img_j != i:
counts[img_j] += 1
# Remove self-match and store
daisy_featured_matching[i, :] = np.delete(counts, i)
print("DAISY Matching Features Shape:", daisy_featured_matching.shape)
100%|███████████████████████████████████████| 2148/2148 [37:26<00:00, 1.05s/it]
DAISY Matching Features Shape: (2148, 2147)
We performed DAISY feature extraction with key point matching. Key point matching works by first detecting key points of an image and then computing DAISY descriptors specifically at those points. Matching is then performed by comparing descriptors between corresponding key points in different images. Key point matching is more computationally expensive and may sometimes reduce overall classification accuracy if the global context is important.
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
# Split into train and test sets
X_train, X_test, y_train, y_test = train_test_split(
daisy_featured_matching, y, test_size=0.2, random_state=42
)
# Train KNN classifier
knn_dsy = KNeighborsClassifier(n_neighbors=3)
knn_dsy.fit(X_train, y_train)
# Evaluate accuracy
acc_dsy = accuracy_score(knn_dsy.predict(X_test), y_test)
print(f"Daisy Accuracy (keypoint matching features, optimized): {100*acc_dsy:.2f}%")
Daisy Accuracy (keypoint matching features, optimized): 75.35%
After completeing key point matching, the accuracy achieved was 75.35%. Compared to the accuracy using standard DAISY feature extraction of 83.02%, key point matching is not a better option. On top of this, this key point matching is much more computationally expensive. Overall, for this dataset, using the DAISY feature extraction without keypoint matching provides better performance. This shows that the global context is important within the datasset as opposed to only comparing key points within the images.